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Abstract 

We propose a novel unsupervised extractive approach for summarizing online reviews by ex¬ 
ploiting review helpfulness ratings. In addition to using the helpfulness ratings for review-level 
filtering, we suggest using them as the supervision of a topic model for sentence-level content 
scoring. The proposed method is metadata-driven, requiring no human annotation, and generaliz- 
able to different kinds of online reviews. Our experiment based on a widely used multi-document 
summarization framework shows that our helpfulness-guided review summarize is significantly 
outperform a traditional content-based summarizer in both human evaluation and automated eval¬ 
uation. 

1 Introduction 

Multi-document summarization has great potential in online reviews, as manually reading comments 
provided by other users is time consuming if not impossible. While extractive techniques arc generally 
preferred over abstractive ones (as abstraction can introduce disfluency), existing extractive summarizes 
arc either supervised or based on heuristics of certain desired characteristics of the summarization result 
(e.g., maximize n-gram coverage (Nenkova and Vanderwende, 2005), etc.). However, when it comes 
to online reviews, there arc problems with both approaches: the first one requires manual annotation 
and is thus less generalizable; the second one might not capture the salient information in reviews from 
different domains (camera reviews vs. movie reviews), because the heuristics arc designed for traditional 
genres (e.g., news articles) while the utility of reviews might vary with the review domain. 

We propose to exploit review metadata, that is review helpfulness ratings *, to facilitate review sum¬ 
marization. Because this is user-provided feedback on review helpfulness which naturally reflects users’ 
interest in online review exploration, our approach captures domain-dependent salient information adap¬ 
tively. Furthermore, as this metadata is widely available online (e.g., Amazon.com, IMDB.com) 2 , our 
approach is unsupervised in the sense that no manual annotation is needed for summarization purposes. 
Therefore, we hypothesize that summarizes guided by review helpfulness will outperform systems based 
on textual features/heuristics designed for traditional genres. To build such helpfulness-guided summa¬ 
rizes, we introduce review helpfulness during content selection in two ways: 1) using the review-level 
helpfulness ratings directly to filter out unhelpful reviews, 2) using sentence-level helpfulness features 
derived from review-level helpfulness ratings for sentence scoring. As we observe in our pilot study 
that supervised LDA (sLDA) (Blei and McAuliffe, 2010) trained with review helpfulness ratings has 
potential in differentiating review helpfulness at the sentence level, we develop features based on the 
inferred hidden topics from sLDA to capture the helpfulness of a review sentence for summarization pur¬ 
poses. We implement our helpfulness-guided review summarizes based on an widely used open-source 
multi-document extractive summarization framework (MEAD (Radev et al., 2004)). Both human and 
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'This is the percentage of readers who found the review to be helpful (Kim et al., 2006). 

2 If it is not available, the review helpfulness can be assessed fully automatically (Kim et al., 2006; Liu et al., 2008). 
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automated evaluations show that our helpfulness-guided summarizes outperform a strong baseline that 
MEAD provides across multiple review domains. Further analysis on the human summaries shows that 
some effective heuristics proposed for traditional genres might not work well for online reviews, which 
indirectly supports our use of review metadata as supervision. The presented work also extrinsically 
demonstrates that the helpfulness-related topics learned from the review-level supervision can capture 
review helpfulness at the sentence-level. 

2 Related Work 

In multi-document extractive summarization, various unsupervised approaches have been proposed to 
avoid manual annotation. A key task in extractive summarization is to identify important text units. 
Prior successful extractive summarizes score a sentence based on n-grams within the sentence: by the 
word frequency (Nenkova and Vanderwende, 2005), bigram coverage (Gillick and Favre, 2009), topic 
signatures (Fin and Hovy, 2000) or latent topic distribution of the sentence (Haghighi and Vanderwende, 
2009), which all aim to capture the “core” content of the text input. Other approaches regal'd the n- 
gram distribution difference (e.g., Kullback-Fieber (KF) divergence) between the input documents and 
the summary (Fin et al., 2006), or based on a graph-representation of the document content (Erkan and 
Radev, 2004; Feskovecl3 et ah, 2005), with an implicit goal to maximize the output representativeness. 
In comparison, while our approach follows the same extractive summarization paradigm, it is metadata 
driven, identifying important text units through the guidance of user-provided review helpfulness assess¬ 
ment. 

When it comes to online reviews, the desired characteristics of a review summary are different from 
traditional text genres (e.g., news articles), and could vary from one review domain to another. Thus 
different review summarizes have been proposed to focus on different desired properties of review sum¬ 
maries, primarily based on opinion mining and sentiment analysis (Carenini et al., 2006; Ferman et al., 
2009; Ferman and McDonald, 2009; Kim and Zhai, 2009). Here the desired property varies from the 
coverage of product aspects (Carenini et al., 2006; Ferman et al., 2009) to the degree of agreement on 
aspect-specific sentiment (Ferman et al., 2009; Ferman and McDonald, 2009; Kim and Zhai, 2009). 
While there is a large overlap between text summarization and review opinion mining, most work fo¬ 
cuses on sentiment-oriented aspect extraction and the output is usually a set of topics words plus their 
representative text units (Hu and Fiu, 2004; Zhuang et ah, 2006). However, such a topic-based summa¬ 
rization framework is beyond the focus of our work, as we aim to adapt traditional extractive techniques 
to the review domain by introducing review helpfulness ratings as guidance. 

In this paper, we utilize review helpfulness via using sFDA. The idea of using sFDA in text summa¬ 
rization is not new. However, the model is previously applied at the sentence level (Fi and Fi, 2012), 
which requires human annotation on the sentence importance. In comparison, our use of sFDA is at 
the document (review) level, using existing metadata of the document (review helpfulness ratings) as the 
supervision, and thus requiring no annotation at all. With respect to the use of review helpfulness ratings, 
early work of review summarization (Fiu et ah, 2007) only consider it as a filtering criteria during input 
preprocessing. Other researchers use it as the gold-standard for automated review helpfulness prediction, 
a predictor of product sales (Ghose and Ipeirotis, 2011), a measurement of reviewers’ authority in social 
network analysis (Fu et ah, 2010), etc. 

3 Helpfulness features for sentence scoring 

While the most straightforward way to utilize review helpfulness for content selection is through filter¬ 
ing (Fiu et ah, 2007) (further discussed in Section 4.3), we also propose to take into account review 
helpfulness during sentence scoring by learning helpfulness-related review topics in advance. Because 
sFDA learns the utility of the topics for predicting review-level helpfulness ratings (decomposing review 
helpfulness ratings by topics), we develop novel features (rHelpSum and sHelpSum) based on the in¬ 
ferred topics of the words in a sentence to capture its helpfulness in various perspectives. We later use 
them for sentence scoring in a helpfulness-guided summarizer (Section 4.3). 

Compared with FDA (Blei et ah, 2003), sFDA (Blei and McAuliffe, 2010) introduces a response 
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variable ip G Y to each document D, during topic discovery. The model not only learns the topic 
assignment z\ : n for words w\ : n in Di, it also learns a function from the posterior distribution of 2 in D 
to Y. When Y is the review-level helpfulness gold-standard, the model leans a set of topics predictive of 
review helpfulness, as well as the utility of z in predicting review helpfulness y,. denoted as //. (Both z 
and rj are K-dimensional.) 

At each inference step, sLDA assigns a topic ID to each word in every review, z; = k means that 
the topic ID for word at position l in sentence s is k. Given the topic assignments z\ : l to words w\-l 
in a review sentence s, we estimate the contribution of s to the helpfulness of the review it belongs to 
(Formula 1), as well as the average topic importance in s (Formula 2). While rHelpSurn is sensitive to 
the review length, sHelpSum is sensitive to the sentence length. 
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As the topic assignment in each inference iteration might not be the same, Riedl and Biemann (Riedl 
and Biemann, 2012) proposed the mode method in their application of LDA for text segmentation - use 
the most frequently assigned topic for each word in all iterations as the final topic assignment - to address 
the instability issue. Inspired by their idea, we also use the mode method to infer the topic assignment in 
our task, but only apply the mode method to the last 10 iterations, because the topic distribution might 
not be well learned at the beginning. 


4 Experimental setup 

To investigate the utility of exploiting user-provided review helpfulness ratings for content selection in 
extractive summarization, we develop two helpfulness-guided summarizes based on the MEAD frame¬ 
work (HelpfulFilter and HelpfulSum). We compare our systems’ performance against a strong unsu¬ 
pervised extractive summarizer that MEAD supports as our baseline (MEAD+LexRank). To focus 
on sentence scoring only, we use the same MEAD word-based MMR (Maximal Marginal Relevance) 
reranker (Carbone11 and Goldstein, 1998) for all summarizes, and set the length of the output to be 200 
words. 


4.1 Data 

Our data consists of two kinds of online reviews: 4050 Amazon camera reviews provided by Jindal and 
Liu (2008) and 280 IMDB movie reviews that we collected by ourselves. Both corpora were used in 
our prior work of automatically predicting review helpfulness, in which every review has at least three 
helpfulness votes. On average, the helpfulness of camera reviews is .80 and that of movie reviews is .74. 

Summarization test sets. Because the proposed approach method is purely unsupervised, and we do 
not optimize our summarization parameters during learning, we evaluate our approach based on a subset 
of review items directly: we randomly sample 18 reviews for each review item (a camera or movie) and 
randomly select 3 items for each review domain. In total there arc 6 summarization test sets (3 items x 
2 domains), where each contains 18 reviews to be summarized (i.e. “summarizing 18 camera reviews 
for Nikon D3200”). In the summarization test sets, the average number of sentences per review is 9 
for camera reviews, and 18 for movie reviews; the average number of words per sentence in the camera 
reviews and movie reviews arc 25 and 27, respectively. 

4.2 sLDA training 

We implement sLDA based on the topic modeling framework of Mallet (McCallum, 2002) using 20 
topics (K = 20) and the best hyper-parameters (topic distribution priors a and word distribution priors 
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(3) that we learned in our pilot study on LDA. 3 

Since our summarization approach is unsupervised, we learn the topic assignment for each review 
word using the corresponding sLDA model trained on all reviews of that domain (4050 reviews for 
camera and 280 reviews for movie). 4 

4.3 Three summarizers 

Baseline (MEAD+LexRank): The default feature set of MEAD includes Position, Length, and Centroid. 
Here Length is a word-count threshold, which gives score 0 to sentences shorter than the threshold. As 
we observe that short review sentences sometimes can be very informative as well (e.g., “This camera is 
so amazing!”, “The best film I have ever seen!”), we adjust Length to 5 from its default value 9. MEAD 
also provides scripts to compute LexRank (Erkan and Radev, 2004), which is a more advanced feature 
using graph-based algorithm for computing relative importance of textual units. We supplement the 
default feature set with LexRank to get the best summarizer from MEAD, yielding the sentence scoring 
function Ff- )ase i me (s), in which s is a given sentence and all features are assigned equal weights (same as 
in the other two summarizers). 


Ft 


baseline 


(s) 


Position + Centroid + LexRank if Length > 5 
0 if Length < 5 


(3) 


HelpfulFilter: This summarizer is a direct extension of the baseline, which considers review-level help¬ 
fulness ratings ( hRating ) as an additional filtering criteria in its sentence scoring function F HelpfulFilter ■ 
(In our study, we omit the automated prediction (Kim et al., 2006; Liu et al., 2008) and filter reviews 
by their helpfulness gold-standard directly.) We set the cutting threshold to be the average helpful¬ 
ness rating of all the reviews that we used to train the topic model for the corresponding domain 
(hRating Ave(domain)). 


Fh elpful Filter ( ® ) 


Fbaseiine(s) if hRating(s) > hRatingAve(domain) 
0 if hRating(s) < hRatingAve(domain) 


(4) 


HelpfulSum: To isolate the contribution of review helpfulness, the second summarizer only uses help¬ 
fulness related features in its sentence scoring function FnelpfulSum■ The features are rHelpSum - the 
contribution of a sentence to the overall helpfulness of its corresponding review, sHelpSum - the average 
topic weight in a sentence for predicting the overall helpfulness of the review (Formula 1 and 2), plus 
hRating for filtering. Note that there is no overlap between features used in the baseline and Helpful- 
Sum, as we wonder if the helpfulness information alone is good enough for discovering salient review 
sentences. 


Fh elpf id S um{ ) 


rHelpSum(s) + sHelpSum(s) if hRating(s) > hRatingAve(domain) 

0 if hRating(s) < hRatingAve(domain) 

(5) 


5 Evaluation 

For evaluation, we will first present our human evaluation user study and then present the automated 
evaluation result based on human summaries collected from the user study. 

3 In our pilot study, we experimented with various hyper-parameter settings, and trained the model with 100 sampling 
iterations in both the Estimation and the Maximization steps. As we found the best results are more likely to be achieved when 
a = 0.5, (3 = 0.1, we use this setting to train the sLDA model in our summarization experiment. 

4 In practice, this means that we need to (re)train the topic model after given the summarization test set. 
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5.1 Human evaluation 

The goal of our human evaluation is to compare the effectiveness of 1) using a traditional content selec¬ 
tion method (MEAD+LexRank), 2) using the traditional method enhanced by review-level helpfulness 
filtering (HelpfulFilter), and 3) using sentence helpfulness features estimated by sLDA plus review-level 
helpfulness filtering (HelpfulSum) for building an extractive multi-document summarization system for 
online reviews. Therefore, we use a within-subject design in our user study for each review domain, 
considering the summarizer as the main effect on human evaluation results. 

The user study is earned out in the form of online surveys (one survey per domain) hosted by Quadrics. 
In total, 36 valid users participated in our online-surveys. 5 We randomly assigned 18 of them to the 
camera reviews, and the rest 18 to the movie reviews. 

5.1.1 Experimental procedures 

Each online survey contains three summarization sets. The human evaluation on each one is taken in 
three steps: 

Step 1 : We first require users to perform manual summarization, by selecting 10 sentences from the 
input reviews (displayed in random order for each visit). This ensures that users are familiar with the 
input text so that they can have fair judgement on machine-generated results. To help users select the 
sentences, we provide an introductory scenario at the beginning of the survey to illustrate the potential 
application in accordance with the domain (e.g., Figure 1). 


Neither Agree nor 

Strongly Disagree Disagree Disagree Agree Strongly Agn 


Scenario 


Imagine that you want to buy a new camera (or a camera lens, flashlight etc.). Now you are reading 
its online reviews Ion Amazon.com) to find out whether you should buy it 

To facilitate you digesting the product reviews, we summarize them with three different 
summarizers, aiming to extract the essence of the reviews. In this survey, vou will compare the 
summaries generated by the systems regarding how helpful/informative they are for vou to make a 

buying decision. 


Figure 1: Scenario for summarizing camera reviews 


1 2 3 4 5 


It covers ALL 

information 1 would Hie 
to include. 



It contains NO 
information that 1 would 
NOT have included. 



It reflects the ideas of 





Figure 2: Content evaluation 


Step 2: We then ask users to perform pairwise comparison on summaries generated by the three sys¬ 
tems. The three pairs are generated in random order; and the left-or-right display position (in Figure 3) 
of the two summaries in each pair is also randomly selected. Here we use the same 5-level preference 
ratings used in (Herman et ah, 2009), and translate them into integers from -2 to 2 in our result analysis. 
Step 3: Finally, we ask users to evaluate the three summaries in isolation regarding the summary quality 
in three content-related aspects: recall, precision and accuracy (top, middle and bottom in Figure 2, 
respectively), which were used in (Carenini et al., 2006). In this content evaluation, the three summaries 
are randomly visited and the users rate the proposed statements (one for each aspect) on a 5-point scale. 

5.1.2 Results 

Pairwise comparison. We use a mixed linear model to analyze user preference over the three summary 
pairs separately, in which “summarizer” is a between-subject factor, “review item” is the repeated factor, 
and “user” is a random effect. Results are summarized in Table 1. (Positive preference ratings on “A 
over B” means A is preferred over B; negative ratings means B is preferred over A.) As we can see, 
HelpfulSum is the best: it is consistently preferred over the other two summarizers across domains and 
the preference is significant throughout conditions except when compared with HelpfulFilter on movie 
reviews. HelpfulFilter is significantly preferred over the baseline (MEAD+FexRank) for movie reviews, 
while it does not outperform the baseline on camera reviews. A further look at the compression rate 
(cRate) of the three systems (Table 2) shows that on average HelpfulFilter generates shortest summaries 

5 All participants are older than eighteen, recruited via university mailing lists, on-campus flyers as well as social networks 
online. While we also considered educational peer reviews as a third domain, about half of the participants dropped out in the 
middle of the survey. Thus we only consider the two e-commerce domains in this paper. 
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Here are two summaries about the set of reviews you just read. Which one of them is more helpful/informative? 


Summary A 

[1] The 6.3 MP is a significant improvement over 
5 and the ability to take photos in manual mode 
can not be understated. 

[2] The Rebel makes use of Compact Flash - the 
oldest, yet still the best technology for taking fast, 
high-quality photos in digital cameras. 

[3] I'm using a Canon L USM lens, and have 
gotten some terrific shots. 

[4] The problem I get into is focusing speed and 
zones. 

[5] You can use a high-end consumer body like 
this one, use a professional piece of Canon glass 
(lens) and take excellent photos. 

[6] I f you're taking portraits, not a problem. 

[7] The sky is the limit. 

[8] Get a good wide-angle and a good, fast 
telephoto and you've got yourself set for some 
great shots. 

[9] For sunny days and outdoor shots, this 
camera is a sheer joy to use 

[10] This camera has plenty of features available, 
or you can just set it for "auto" and that works 
fine, too. 

[11] BTW, the Tamron AF 70-300 Macro 1:2 
zoom is a nice lens to buy with it, and priced 
nicely at Beach Camera 


Summary B 

[1] The "shutter lag" (between when you depress 
the button and when the camera actually takes a 
photo) is fairly well non-existent and the only real 
lag I have is when I am shooting multiple photos 
in RAW format. 

[2] I have had this camera since Jan '05 and have 
so far taken approximately 10,000 shots while on 
trips and at weddings. 

[3] You can take up to 3fps very easily, but if you 
click-click-click the shutter, it doesn't matter if 
Bigfoot the Loch Ness Monster and Elvis start 
doing a little soft-shoe right in front of you, by the 
time the Rebel finishes writing the recent 3 quick 
shots to the CF card, the shot of the century has 
already slithered back into the swamp by the time 
the camera is ready to be used again. 

[4] I f you are primarily an amateur who is either 
used to using a Point and Shoot (film or digital) or 
who has used a consumer film SLR, you 'll find 
this camera easy to operate and use to the extent 
you used your other cameras. 

[5] There are already tons of reviews on the EOS 
300d (Digital Rebel) but I do want to share my 
experience with the camera, so I will keep this 
short. 


Strongly prefeired A slightly preferred A 

o o 


no preference slightly preferred B strongly preferred B 

o o o 


Figure 3: Example of pairwise comparison for summarizing camera reviews (left:FIelpfulSum, right: the 
baseline). 


among the three summarizers on camera reviews 6 , which makes it naturally harder for HelpfulFilter to 
beat the other two (Napoles et al., 2011). 


Pair 

Domain 

Est. Mean 

Sig. 

HelpfulFilter 

Camera 

-.602 

.001 

over MEAD+LexRank 

Movie 

.621 

.000 

HelpfulSum 

Camera 

.424 

.011 

over MEAD+LexRank 

Movie 

.601 

.000 

HelpfulSum 

Camera 

1.18 

.000 

over HelpfulFilter 

Movie 

.160 

.310 


Summarizer 

Camera 

Movie 

MEAD+LexRank 

6.07% 

2.64% 

HelpfulFilter 

3.25% 

2.39% 

HelpfulSum 

5.94% 

2.69% 

Human (Ave.) 

6.11% 

2.94% 


Table 1: Mixed-model analysis of user preference ratings Table 2: Compression rate of the three 
in pairwise comparison across domains. Confidence inter- systems across domains, 
val = 95%. The preference rating is ranged from -2 to 2. 


Content evaluation. We summarize the average quality ratings (Figure 2) received by each summarizer 
across review items and users for each review domain in Table 3. We carry out paired T-tests for every 
pair of summarizers on each quality metric. While no significant difference is found among the three 
summarizers on any quality metric for movie reviews, there are differences for camera reviews. In terms 
of both accuracy and recall, HelpfulSum is significantly better than HelpfulFilter (p=.008 for accuracy, 
p=.034 for recall) and the baseline is significantly better than HelpfulFilter (p=.005 for accuracy, p=.005 
for recall), but there is no difference between HelpfulSum and the baseline. For precision, no significant 

6 While we limit the summarization output to be 200 words in MEAD, as the content selection is at the sentence level, the 
summaries can have different number of words in practice. Considering that word-based MMR controls the redundancy in the 
selected summary sentences (A = 0.5 as suggested), there might be enough content to select using FneipfuiFuter- 
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difference is observed in either domain. 


Summarizer 

Camera 

Movie 

Metric 

Precision 

Recall 

Accuracy 

Precision 

Recall 

Accuracy 

MEAD+LexRank 

2.63 

3.24 

3.57 

2.50 

2.59 

2.93 

HelpfulFilter 

2.78 

2.74 

3.11 

2.44 

2.61 

2.96 

HelpfulSum 

2.41 

3.19 

3.69 

2.52 

2.67 

3.02 


Table 3: Human ratings for content evaluation. The best result on each metric is bolded for every review 
domain (the higher the better). 


With respect to pairwise evaluation, content evaluation yields consistent results on camera reviews 
between HelpfulFilter vs. the baseline and HelpfulSum vs. HelpfulFilter. However, only pairwise com¬ 
parison (preference ratings) shows significant difference between HelpfulSum vs. the baseline and the 
difference in the summarizers’ performance on movie reviews. This confirms that pairwise comparison 
is more suitable than content evaluation for human evaluation (Lerman et al., 2009). 

5.2 Automated evaluation based on ROUGE metrics 

Although human evaluation is generally preferred over automated metrics for summarization evaluation, 
we report our automated evaluation results based on ROUGE scores (Lin, 2004) using references col¬ 
lected from the user study. For each summarization test set, we have 3 machine generated summaries 
and 18 human summaries. We compute the ROUGE scores in a leave-1-out fashion: for each machine 
generated summary, we compare it against 17 out of the 18 human summaries and report the score aver¬ 
age across the 17 runs; for each human summary, we compute the score using the other 17 as references, 
and report the average human summarization performance. 

Evaluation results arc summarized in Table 4 and Table 5, in which we report the F-measure for R- 
1 (unigram), R-2 (bigram) and R-SU4 (skip-bigram with maximum gap length of 4) 7 , following the 
convention in the summarization community. Here we observe slightly different results with respect 
to human evaluation: for camera reviews, no significant result is observed, while HelpfulSum achieves 
the best R-l score and HelpfulFilter works best regarding R-2 and R-SU4. In both cases the baseline 
is never the best. For movie reviews, HelpfulSum significantly outperforms the other summarizers on 
all ROUGE measurements, and the improvement is over 100% on R-2 and R-SU4, almost the same as 
human does. This is consistent with the result of pairwise comparison in that HelpfulSum works better 
than both HelpfulFilter and the baseline on movie reviews. 


Summarizer 

R-l 

R-2 

R-SU4 

MEAD+FexRank 

.333 

.117 

.110 

HelpfulFilter 

.346 

.121 

.111 

HelpfulSum 

.350 

.110 

.101 

Human 

.360 

.138 

.126 


Table 4: ROUGE evaluation on camera reviews 


Summarizer 

R-l 

R-2 

R-SU4 

MEAD+FexRank 

.281 

.044 

.047 

HelpfulFilter 

.273 

.040 

.041 

HelpfulSum 

.325 

.095 

.090 

Human 

.339 

.093 

.093 


Table 5: ROUGE evaluation on movie reviews 


6 Human summary analysis 

To get a comprehensive understanding of the challenges in extractive review summarization, we analyze 
the agreement in human summaries collected in our user study at different levels of granularity, regarding 
heuristics that arc widely used in existing extractive summarizers. 

Average word/sentence counts. Figure 4 illustrates the trend of average number of words and sentences 
shared by different number of users across review items for each domain. As it shows, no sentence is 

7 Because ROUGE requires all summaries to have equal length (word counts), we only consider the first 100 words in every 
summary. 
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agreed by over 10 users, which suggests that it is hard to make humans agree on the informativeness of 
review sentences. 


ft of shared words (Log 10 ) ft of shared sentences (Log 10 ) 




Figure 4: Average number of words (w) and sentences (s) in agreed Figure 5: Average probability of 
human summaries words used in human summaries 


Word frequency. We then compute the average probability of word (in the input) used by different 
number of human su mm arizes to see if the word frequency pattern found in news articles (words that 
human summarizes agreed to use in their summaries are of high frequency in the input text (Nenkova 
and Vanderwende, 2005)) holds for online reviews. Figure 5 confirms this. However, the average word 
probability is below 0.01 in those shared by 14 out of 18 summaries 8 ; the flatness of the curve seems to 
suggest that word frequency alone is not enough for capturing the salient information in input reviews. 
KL-divergence. Another widely used heuristic in multi-document summarization is minimizing the 
distance of unigram distribution between the summary and the input text (Lin et al., 2006). We wonder 
if this applies to online review summarization. For each testing set, we group review sentences by the 
number of users who selected them in their summaries, and compute the KL-divergence (KLD) between 
each sentence group and the input. The average KL-divergence of each group across review items are 
visualized in Figure 6, showing that this intuition is incorrect for our review domains. Actually, the 
pattern is quite the opposite, especially when the number of users who share the sentences is less than 8. 
Thus traditional methods that aim to minimize KL-divergence might not work well for online reviews. 



Figure 6: Average KL-Divergence between Figure 7: Average BigramSum of sentences 

input and sentences used in human summaries use d in human summaries 


Bigram coverage. Recent studies proposed a simple but effective criteria for extractive summarization 
based on bigram coverage (Nenkova and Vanderwende, 2005; Gillick and Favre, 2009). The coverage 
of a given bigram in a summary is defined as the number of input documents the bigram appears in, and 
presumably good summaries should have larger sum of bigram coverage (BigramSum). However, as 
shown in Figure 7, this criteria might not work well in our case either. For instance, the BigramSum of 
the sentences that are shared by 3 human judges is smaller than those shared by 1 or 2 judges. 


8 The average probability of words used by all 4 human summarizers are 0.01 across the 30 DUC03 sets (Nenkova and 
Vanderwende, 2005). 
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7 Conclusion and future work 

We propose a novel unsupervised extractive approach for summarizing online reviews by exploiting re¬ 
view helpfulness ratings for content selection. We demonstrate that the helpfulness metadata can not 
only be directly used for review-level filtering, but also be used as the supervision of sLDA for sentence 
scoring. This approach leverages the existing metadata of online reviews, requiring no annotation and 
generalizable to multiple review domains. Our experiment based on the MEAD framework shows that 
HelpfulFilter is preferred over the baseline (MEAD+LexRank) on camera reviews in human evaluation. 
HelpfulSum, which utilizes review helpfulness at both the review and sentence level, significantly out¬ 
performs the baseline in both human and automated evaluation. Our analysis on the collected human 
summaries reveals the limitation of traditional summarization heuristics (proposed for news articles) for 
being used in review domains. 

In this study, we consider the ground truth of review helpfulness as the percentage of helpful votes 
over all votes, where the helpfulness votes could be biased in various ways (Danescu-Niculescu-Mizil et 
al., 2009). In the future, we would like to explore more sophisticated models of review helpfulness to 
eliminate such biases, or even automatic review helpfulness predictions based on just review text. We also 
would like to build a fully automated summarize!' by replacing the review helpfulness gold-standard with 
automated predictions as the filtering criteria. Given the collected human summaries, we will experiment 
with different feature combinations for sentence scoring and we will compare our helpfulness features 
with other content features as well. Finally, we want to further analyze the impact of the number of 
human judges on our automated evaluation results based on ROUGE scores. 
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